[Reporting][DataFiltering] Expand data padding to full simulation range#2867
[Reporting][DataFiltering] Expand data padding to full simulation range#2867
Conversation
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1213 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1213 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1213 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1213 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1213 |
allisterakun
left a comment
There was a problem hiding this comment.
LGTM! Thank you Niko! Great improvements!
There was a problem hiding this comment.
I would like to advocate that pandas.NA be used as the default "fill_value" instead of numpy.NaN. From a data science perspective NaN (not a number) and NA (missing value) are meaningfully different: the former indicates that an operation produces a nonsensical result (e.g., division by zero) but the latter indicates that a value was not observed or measured. In this case, the fill value is meant to indicate that the variable in question was not measured at this time step, so NA is the more appropriate value.
Also, I think that the default behavior for non-aggregated data should be to fill gaps for each time step in the model. My reasoning for this is that it retains the full time series, which contains information about when variables were measured and when they weren't. It also ensures that multiple variables line up, which is crucial for any data analysis or comparison. Ultimately, I think that this behavior should be built into the output manager, but padding the data by default in the interim is a fine bridge. I'm happy to discuss these points in more detail at any time.
edit: I suppose an empty cell ("") would also be acceptable in the CSV output instead of pandas.NA. That is typically how excel denotes missing values. But explicit would be better IMO.
|
To also update from conversations yesterday: in testing this I found the use of regex and dictionaries caused this functionality to break and stop the simulation. @ew3361zh is investigating further! |
Whew, after some investigation, I think our culprit is indeed caused when the filtering process finds a variable that was added to OM as a dictionary. For example in nutrient_amounts = {
"dm": average_nutrition_supply.dry_matter,
"CP": average_nutrition_supply.crude_protein,
"ADF": average_nutrition_supply.adf_supply,
"NDF": average_nutrition_supply.ndf_supply,
"lignin": average_nutrition_supply.lignin_supply,
"ash": average_nutrition_supply.ash_supply,
"phosphorus": average_nutrition_supply.phosphorus * GeneralConstants.GRAMS_TO_KG,
"potassium": average_nutrition_supply.potassium_supply,
"N": average_nutrition_supply.nitrogen_supply,
"as_fed": average_nutrition_supply.wet_matter,
"EE": average_nutrition_supply.fat_supply,
"starch": average_nutrition_supply.starch_supply,
"TDN": average_nutrition_supply.tdn_supply,
"DE": average_nutrition_supply.digestible_energy_supply,
"calcium": average_nutrition_supply.calcium * GeneralConstants.GRAMS_TO_KG,
"fat": average_nutrition_supply.fat_supply * GeneralConstants.KG_TO_GRAMS,
"fat_percentage": average_nutrition_supply.fat_percentage,
"forage_ndf": average_nutrition_supply.forage_ndf_supply,
"forage_ndf_percent": average_nutrition_supply.forage_ndf_percentage,
"ME": average_nutrition_supply.metabolizable_energy,
"NE_maintenance_and_activity": average_nutrition_supply.maintenance_energy,
"NE_lactation": average_nutrition_supply.lactation_energy,
"NE_growth": average_nutrition_supply.growth_energy,
"metabolizable_protein": average_nutrition_supply.metabolizable_protein,
}
om.add_variable(f"ration_nutrient_amount_for_{pen_base_name}", nutrient_amounts, info_map)The padding process uses the fill value as is which then produces an issue when trying to save this to a CSV. So take this filter which works on dev: {
"name": "Farmgrown Feed DMI total, kg",
"filters": [
"FeedManager._log_feed_deductions.farmgrown_feed_.*_fed"
],
"expand_data": true,
"use_fill_value_in_gaps": true,
"use_fill_value_at_end": true,
"fill_value": 0
}The reason it works on dev but NOT the feature branch is that on dev we don't have any filling happening AFTER the last piece of data reported to OM whereas on the feature branch, it's filling in one Essentially on the feature branch we have a list of dictionaries for each value in that parent variable dict that look like: The fill value set in the filter is So right now This to me feels like a separate issue is needed for it rather than trying to tackle it here. I can make it a higher priority to make sure it gets fixed quickly. Let me know if that sounds ok reviewers @JoeWaddell @allisterakun @morrowcj |
I think this makes sense to me. I can implement here.
So you're saying the default behavior should be |
That sounds OK to me.
Yes - at least for all report files that contain any raw (non-aggregated) or daily data. Though, I too am interested in the others' perspectives. |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1215 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1215 |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1212 |
I think across the board the default assumption is that output variables are reported on a daily basis and only once unless otherwise stated (see all the ration interval reporting that currently happens). I'll need to think a bit more about how we'd want to handle those cases, and what other cases might be exceptions to the general rule of one output per day. |
We discussed this a little at the dev team wt today and I think there's some concern at users not understanding that padding has occurred if it is the default behavior. This could lead to some potential misinterpretation of the data produced depending on how it looks. So I think if we want to go this route it has to be fully understandable to new users (and experienced users used to a certain behavior). Would NaNs as padding be intuitive enough? I'm not sure. |
|
Current Coverage: 99% Mypy errors on expand-data-fix-2 branch: 1212 |
New default behavior for data padding is to pad data across the entire simulation length.
Context
Issue(s) closed by this pull request: closes #2725
What
use_fill_value_before_start.Why
Repeated confusion from users on data not being padded as expected if the reporting frequency is not daily.
How
Utilityexpand_data_temporally()is to now pad the data for the full length of the simulation. This will allow any data reported to more easily match the dataset length of the variables reported daily.expand_data_to_observed_rangefilter option set totrue.Test plan
I tried making an exhaustive filter to see the behavior:
{ "multiple": [ { "name": "Expand to Observed (Info Maps) Range - no fill value", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "expand_data_to_observed_range": true, "use_fill_value_before_start": false, "use_fill_value_in_gaps": false, "use_fill_value_at_end": false }, { "name": "Expand full simulation - fill everywhere", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "fill_value": "FILL", "use_fill_value_before_start": true, "use_fill_value_in_gaps": true, "use_fill_value_at_end": true }, { "name": "Expand full simulation - fill everywhere - no fill value provided", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "use_fill_value_before_start": true, "use_fill_value_in_gaps": true, "use_fill_value_at_end": true }, { "name": "Expand full simulation - use carried values everywhere", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "use_fill_value_before_start": false, "use_fill_value_in_gaps": false, "use_fill_value_at_end": false }, { "name": "Expand full simulation - fill start only", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "fill_value": "FILL", "use_fill_value_before_start": true, "use_fill_value_in_gaps": false, "use_fill_value_at_end": false }, { "name": "Expand full simulation - fill gaps only", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "fill_value": "FILL", "use_fill_value_before_start": false, "use_fill_value_in_gaps": true, "use_fill_value_at_end": false }, { "name": "Expand full simulation - fill end only", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ], "expand_data": true, "fill_value": "FILL", "use_fill_value_before_start": false, "use_fill_value_in_gaps": false, "use_fill_value_at_end": true }, { "name": "No expansion", "filters": [ "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass" ] } ] }Input Changes
Output Changes
Filter